Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import transactions by quarter, in chunks of a few hundred #212

Merged
merged 10 commits into from
Sep 23, 2024

Conversation

hancush
Copy link
Member

@hancush hancush commented Sep 18, 2024

Overview

The transaction import makes something like 10 database calls per iteration. That can really slow us down when our database connection and/or cloud compute environment isn't zippy (#210).

This PR reduces the number of iterations from n to roughly n/4 by slicing the import by quarter and reduces the number of database calls (made from transaction saves, other queries are unaffected) from n to n/500 by batching transaction saves in chunks of 500. Both of these options are configurable, but are given reasonable defaults.

It further reduces run time by importing only one transaction type per job, rather than both contributions and expenditures. Jobs are now created per a matrix strategy, so that all transaction imports run concurrently.

Connects #210.

Notes

I initially did a month filter, but filing periods tend to correspond to quarters, so some months have no filings. I thought it better to do quarters for jobs of roughly equal size, but with no waste jobs (i.e., jobs that don't import anything).

The quarter and year parameters are a little confusing.

When filing period spans years, its transactions can occur across two data files. The year is the vintage of the data file, not the transaction date or filing period, and is used to remove only transactions from a given filing in the given year (i.e., ones that should be reimported in a given run).

The quarter is used to filter filings to only those with a filing period beginning in the given quarter of any year. In this way, it accounts for filings spanning years. For example, consider a filing period starting in December 2023 and ending in February 2024. Transactions would be split across the 2023 and 2024 files. To get them all, you would run the Q4 import for both 2023 and 2024.

Testing Instructions

@hancush hancush temporarily deployed to openness-pro-hcg-batch--pqadyk September 18, 2024 17:34 Inactive
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 18, 2024 20:35 Inactive
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 19:45 Inactive
@hancush hancush changed the title Add month filter to transaction import Import transactions by quarter, in chunks of a few hundred Sep 19, 2024
Comment on lines +19 to +20
.SECONDEXPANSION:
import/% : _data/sorted/$$(word 1, $$(subst _, , $$*))_$$(word 3, $$(subst _, , $$*)).csv
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse the transaction type and year out of a pattern like CON_1_2023. One transaction file covers the entire year, so we don't need to download it again for each quarterly import.

Comment on lines +5 to +7
define quarterly_target
$(foreach YEAR,$(1),$(patsubst %,import/$(2)_%_$(YEAR),1 2 3 4))
endef
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a target for each year ($1), transaction type ($2), and quarter.

@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 20:00 Inactive
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 20:05 Inactive
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 20:13 Inactive
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 22:00 Inactive
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 22:03 Inactive
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 19, 2024 22:05 Inactive
@hancush hancush marked this pull request as ready for review September 23, 2024 13:28
with:
ref: "deploy"
- name: Import data for 2024
ref: "hcg/batch-it-up"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Change back to deploy before merge.

Copy link
Member

@fgregg fgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. one bug fix, and some suggestions.

"""
return filter(
lambda x: get_quarter(x[0][2]) in filing_quarters,
groupby(tqdm(records), key=filing_key),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tqdm is a bit confusing, as it is over all the records, not just the filtered records.

would something like

return groupby(tqdm(filter(records, lambda x: ...)), key=filing_key)

work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Done.


if not len(batch) % batch_size:
self._save_batch(batch)
batch = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to handle the case that you have iterated through all the records, but the batch isn't modulo the batch_size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks.

contribution = self.make_contribution(record, None, filing)
batch.append(contribution)

if not len(batch) % batch_size:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when doing modulo, i think it's better to have the form that

len(batch) % batch_size == 0

i think it's just a touch more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Done.

@hancush hancush requested a review from fgregg September 23, 2024 18:27
@hancush hancush temporarily deployed to openness-pro-hcg-batch--d5evuv September 23, 2024 18:27 Inactive
Copy link
Member

@fgregg fgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

Copy link
Contributor

@antidipyramid antidipyramid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very curious to see what the performance improvement looks like!

):
yield cls.objects.bulk_create(cls_records)

def import_contributions(self, f, quarters, year, batch_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a clean function.

@hancush hancush merged commit 25ba297 into main Sep 23, 2024
2 checks passed
@hancush hancush deleted the hcg/batch-it-up branch September 23, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants